ESSNet on Statistical Disclosure Control

Task 7. Synthetic data files

Publication of synthetic -i.e. simulated- data is an alternative to masking original data when protecting data against disclosure. The idea is to randomly generate data with the constraint that certain statistics or internal relationships of the original dataset should be preserved.
Rubin (1993) addresses a new approach by generating fully synthetic data sets to guarantee confidentiality. His idea was to treat all the observations from the sampling frame that are not part of the sample as missing data and to impute them according to the multiple imputation framework. Afterwards, several simple random samples from these fully imputed data sets are released to the public. Because all imputed values are random draws from the posterior predictive distribution of the missing values given the observed values, disclosure of sensitive information is nearly impossible, especially if the released data sets don’t contain any real data. Another advantage of this approach is the sampling design for the imputed data sets. As the released data sets can be simple random samples from the population, the analyst doesn’t have to allow for a complex sampling design in his models. With this approach the sampling weights can be completely removed from the released data; this is a further advantage as these weights often carry disclosive information, especially for surveys on enterprises (where the inclusion probability is usually proportional to size).
Rubin’s proposal was more completely developed in Raghunathan, Reiter, and Rubin (2003). A simulation study of it was given in Reiter (2002). In Reiter (2005a) inference on synthetic data is discussed and in Reiter (2005b) an application is given.
Further application of the multiple imputation philosophy is found in An and Little (2007), where selected records showing high levels of a target variable are imputed, thus providing an alternative to top-coding.
However, the quality of this method strongly depends on the accuracy of the model used to impute the "missing" values. If the model doesn’t include all the relationships between the variables that are of interest to the analyst or if the joint distribution of the variables is mis-specified, results from the synthetic data sets can be biased. Furthermore, specifying a model that considers all the skip patterns and constraints between the variables can be cumbersome if not impossible.
To overcome these problems, a related approach, suggested by Little (1993), replaces observed values with imputed values only for variables that bear a high risk of disclosure or for variables that contain especially sensitive information, leaving the rest of the data unchanged. This approach, discussed as generating partially synthetic data sets in the literature (see also Reiter, 2003), has been adopted for some data sets in the US (see for example Abowd and Woodcock, 2001, 2004, Abowd, Stinson and Benedetto 206 or Kennickell, 1997).
To address the problem of model misspecification, another area of active research has been the formulation of nonparametric statistical models. Reiter (2005c) proposes use of CART to estimate nonparametrically the distribution that generates the synthetic data. Franconi and Polettini (2007) have investigated use of Bayesian networks to generate synthetic data allowing for logical constraints among categorical variables. Methods for generating partially synthetic data that maintain specific statistics on certain sub-domains have been proposed also by Polettini (2003), within a semiparametric framework. Both the latter methods have been implemented on enterprise microdata.
Drechsler et al. (2006) and Drechsler/Bender/Rässler (2007) describe an application to German datasets by generating fully synthetic datasets and partial synthetic data sets to a panel of establishments. Their results for a cross section are very promising.
A criticism often made about synthetic data is that they only preserve the relationships considered in the model used to create them. That is, for data uses not anticipated by the data protector, such as subdomain analysis, completely wrong results will be obtained. This is not so for SDC methods based on masking. Recently, an interesting attempt at combining the advantages of synthetic data (specifically the Burridge (2004) method) and masking has been made by Muralidhar and Sarathy (2007). In the last few years, the use of synthetic data has attracted much more attention as an alternative to micro data protection. The focus of this development and application of synthetic data is mainly in the USA. It should be investigated whether this could be useful for Europe as well. In all of the previously mentioned applications, further studies are needed to assess large scale implementability of these methods, to study practical issues related to such implementation, namely computational burden and time, and to check the data utility of the resulting synthetic data, especially for what concerns the goodness of fit of the model and its ability to reproduce to a large extent the observed relationships over subdomains. These aspects have to be carefully evaluated also in the perspective to allow access to synthetic data, an instance where the analyst’s model is unknown and possibly involves specific subpopulations.
The methods cited in the paragraphs above will be analyzed, case studies (e.g. on a sample of enterprises stemming from an Italian survey) will be developed and an overview report with some recommendations will be produced. If useful we will try to incorporate methods for generating synthetic data into µ-ARGUS in the second year.
Partners: URV, AT and DE for the report, IT also for an example of synthetic data on a specific survey and NL for the implementation in µ-ARGUS, DE and IT for testing the feasibility of these methods.).
Deliverables: Report after year 1 and a possible implementation at the end of year 2
References:
J.M. Abowd, M. Stinson and G. Benedetto [2006]: Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project, mimeo, Washington.
J.M. Abowd and S.D. Woodcock [2001]: Disclosure limitation in longitudinal linked data. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. North-Holland, Amsterdam 215-277
J.M. Abowd and S.D. Woodcock [2004]: Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data. Privacy in Statistical Databases. Springer Verlag, New York 290-297
D. An and R. J. A. Little (2007) Multiple imputation: an alternative to top coding for statistical disclosure control, Journal of the Royal Statistical Society, Series A, 170 (4): 923-940
J. Burridge (2004). Information preserving statistical obfuscation. Statistics and Computing, 13:321-327, 2003.
J. Drechsler, A. Dundler, S. Bender, S. Rässler and T. Zwick (2006). A New Approach for Disclosure Control in the IAB Establishment Panel - Multiple Imputation for a Better Data Access, UNECE Work Session on Statistical Data Editing (Bonn, Germany, 25-27 September 2006), Invited Paper.
J. Drechsler, S. Bender and S. Rässler (2007). Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control in the German IAB Establishment Panel, mimeo, Nuremberg..
L. Franconi and S. Polettini (2007). Some experiences at Istat on data simulation. Proceedings of ISI Conference, Lisbon, 23-29 August, 2007.
A.B. Kennickell [1997]: Multiple imputation and disclosure protection: The case of the 1995 Survey of Consumer Finances. Record Linkage Techniques. National Academy Press, Washington D.C. 248-267
R.J.A. Little [1993]: Statistical Analysis of Masked Data, Journal of Official Statistics, Vol. 9, 407-426
K. Muralidhar and R. Sarathy (2007). Generating sufficiency-based non-synthetic perturbed data, IEEE Transactions on Knowledge and Data Engineering (to appear).
S. Polettini (2003). Maximum entropy simulation for microdata protection, Statistics and Computing, 13(4), 307-320.
T. J. Raghunathan, J. P. Reiter, and D. Rubin (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1):1-16, 2003.
J. P. Reiter (2002). Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics, 18(4):531-544, 2002.
J. P. Reiter (2003) Inference for partially synthetic, public use microdata sets. Survey Methodology, 29: 181-188.
J. P. Reiter (2005a). Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A, 168:185-205, 2005.
J. P. Reiter (2005b). Significance tests for multi-component estimands from multiply-imputed, synthetic microdata. Journal of Statistical Planning and Inference, 131(2):365-377, 2005.
J. P. Reiter (2005c) Using CART to generate partially synthetic public use microdata. Journal of Official Statistics, 21, 441 - 462.
D. E. Rubin (1993). Discussion of statistical disclosure limitation. Journal of Official Statistics, 9(2):461-468, 1993.